knitr::opts_chunk$set(echo = TRUE, eval = TRUE, fig.align = "center", warning = F, message = F,
tidy=TRUE, tidy.opts=list(width.cutoff=60), R.options=list(max.print=100))
##Introduction
library(fivethirtyeight)
library(tidyverse)
library(dplyr)
data(hate_crimes)
data(police_killings)
The two datasets that were chosen are “police_killings” and “hate_crimes”. Police_killings contains 467 observations with 34 variables which include: name, age, gender, raceethnicity, month, day, year, streetaddress, city, state, latitude, state_fp, county_fp, tract_ce, geo_id, county_id, nameIsad, lawenforcementagency, cause, armed, pop, share_white, share_hispanic, share_black, p_income, h_income, county_income, comp_income, county_bucket, nat_bucket, pov, urate, and college. This data was collected based on where police have killed Americans in 2015. The database that will be analyzed was created by The Guardian by combining data collected from media outlets, submissions, and open-sourced efforts. Then, The Guardian used their personal data to confirm this information and further collect data on the demographics of the victims of police killings. Hate_crimes contains 51 observations with 13 variables which include: state, state_abbrev, median_house_inc, share_unemp_seas, share_pop_metro, share_pop_hs, share_non_citizen, share_white_poverty, gini_index, share_non_white, share_vote_trump, hate_crimes_per_100k_splc, and avg_hatecrimes_per_100k_fbi. This data was collected after the 2016 election where hate crimes sky-rocketed from 16 per day to 90 per day, based on reportings to the Southern Poverty Law Center. The data is limited considering some hate crimes could have gone unreported. While this dataset was created to compare hate crimes to income inequality, I will be looking for a trend between these hate crimes and police killings between race along with analyzing income as a factor.
fulldata <- full_join(police_killings, hate_crimes, by = c(state = "state_abbrev"))
fulldata2 <- fulldata %>% select(name, age, gender, raceethnicity,
state, cause, armed, pop, county_income, median_house_inc,
hate_crimes_per_100k_splc, avg_hatecrimes_per_100k_fbi)
The data was merged using full_join in order to combine all the data from both data sets. The two datasets were combined using state abbreviations and it was named “state”. The combined dataset was named fulldata and has 517 observations and 46 variables before NAs are removed. Potential issues with combining these datasets is that not all state locations are shared. Only 12 of the most relevant columns were chosen within the datasets, leaving 471 observations of 12 variables.
##Tidying
full3 <- fulldata %>% pivot_longer(starts_with("share"), names_to = "race",
values_to = "count", values_drop_na = TRUE)
Since I was unable to fid where to use tudying in the Wrangling section, I decided to visualize pivot longer by combining variables that started with share and put their numerical values into their own column, called count. However, this is not the best approach for the data because the information is more dificult to understand.
##Wrangling
full2 <- fulldata %>% na.omit %>% rename(`race/ethnicity` = "raceethnicity") %>%
separate("name", into = c("first_name", "last_name"), sep = " ") %>%
select(last_name, "race/ethnicity", age, state, armed) %>%
filter(armed == "No") %>% summarise(mean(age))
full1 <- fulldata %>% na.omit %>% select(gender, raceethnicity,
armed, age, hate_crimes_per_100k_splc, median_house_inc) %>%
arrange(desc(age)) %>% mutate(hate_income = (hate_crimes_per_100k_splc/median_house_inc))
fulldata %>% group_by(age) %>% summarise(mean(pop))
## # A tibble: 61 x 2
## age `mean(pop)`
## <int> <dbl>
## 1 16 4503.
## 2 17 3149.
## 3 18 3020.
## 4 19 4258.
## 5 20 4506.
## 6 21 5079.
## 7 22 5068.
## 8 23 5907
## 9 24 4903.
## 10 25 4316.
## # … with 51 more rows
fulldata %>% group_by(gender) %>% summarise(n_distinct(cause))
## # A tibble: 3 x 2
## gender `n_distinct(cause)`
## <chr> <int>
## 1 Female 3
## 2 Male 5
## 3 <NA> 1
fulldata %>% na.omit %>% group_by(raceethnicity) %>% summarise(median(age))
## # A tibble: 5 x 2
## raceethnicity `median(age)`
## <chr> <dbl>
## 1 Asian/Pacific Islander 35
## 2 Black 32
## 3 Hispanic/Latino 31
## 4 Native American 27
## 5 White 39
fulldata %>% group_by(state) %>% summarise(n_distinct(age))
## # A tibble: 51 x 2
## state `n_distinct(age)`
## <chr> <int>
## 1 AK 1
## 2 AL 8
## 3 AR 4
## 4 AZ 20
## 5 CA 34
## 6 CO 9
## 7 CT 1
## 8 DC 1
## 9 DE 2
## 10 FL 20
## # … with 41 more rows
fulldata %>% group_by(cause) %>% summarise(n_distinct(raceethnicity))
## # A tibble: 5 x 2
## cause `n_distinct(raceethnicity)`
## <chr> <int>
## 1 Death in custody 4
## 2 Gunshot 6
## 3 Struck by vehicle 3
## 4 Taser 6
## 5 <NA> 4
fulldata %>% group_by(armed) %>% summarise(var(median_house_inc))
## # A tibble: 8 x 2
## armed `var(median_house_inc)`
## <chr> <dbl>
## 1 Disputed 163461280.
## 2 Firearm 68796358.
## 3 Knife 56715205.
## 4 No 53775562.
## 5 Non-lethal firearm 45981263.
## 6 Other 77207900.
## 7 Vehicle 52499581.
## 8 <NA> 34012412.
fulldata %>% group_by(pop) %>% summarise(median(hate_crimes_per_100k_splc))
## # A tibble: 446 x 2
## pop `median(hate_crimes_per_100k_splc)`
## <int> <dbl>
## 1 0 0.150
## 2 403 NA
## 3 678 0.110
## 4 732 0.188
## 5 1271 0.351
## 6 1325 0.185
## 7 1345 0.110
## 8 1354 0.256
## 9 1363 0.214
## 10 1392 NA
## # … with 436 more rows
fulldata %>% group_by(county_income) %>% summarise(median(share_vote_trump))
## # A tibble: 300 x 2
## county_income `median(share_vote_trump)`
## <int> <dbl>
## 1 22545 0.53
## 2 24927 0.53
## 3 25498 0.580
## 4 26877 0.63
## 5 29769 0.63
## 6 30164 0.51
## 7 30201 0.65
## 8 31163 0.69
## 9 31446 0.580
## 10 31476 0.5
## # … with 290 more rows
fulldata %>% group_by(hate_crimes_per_100k_splc) %>% summarise(max(pop))
## # A tibble: 48 x 2
## hate_crimes_per_100k_splc `max(pop)`
## <dbl> <int>
## 1 0.0674 7590
## 2 0.0691 4227
## 3 0.0783 8810
## 4 0.0954 NA
## 5 0.105 5378
## 6 0.110 5375
## 7 0.120 18168
## 8 0.124 10818
## 9 0.126 6971
## 10 0.134 6888
## # … with 38 more rows
fulldata %>% group_by(hate_crimes_per_100k_splc) %>% summarise(share_vote_trump)
## # A tibble: 471 x 2
## # Groups: hate_crimes_per_100k_splc [48]
## hate_crimes_per_100k_splc share_vote_trump
## <dbl> <dbl>
## 1 0.0674 0.580
## 2 0.0674 0.580
## 3 0.0674 0.580
## 4 0.0674 0.580
## 5 0.0674 0.580
## 6 0.0674 0.580
## 7 0.0691 0.6
## 8 0.0691 0.6
## 9 0.0691 0.6
## 10 0.0691 0.6
## # … with 461 more rows
The dataset “full2” changed the initial variable “raceethnicity” into “race/ethnicity” so it was more appealing when reading the full dataset. Next, name was split into first and last so the data could be looked at by last names instead. In order to see the information of people not armed, I filtered for this and summarized to find the mean age of people unarmed who were killed byy police, which gave a value of 39. The new dataset “full1” looks specifically at seven columns: gender, race/ethnicity, age, state, hate crimes within the area, median household income, and if the victim was armed. The data was arranged from youngest to oldest of the victims and then mutate was used to analyzed the mean hate crimes based on house hold income. The data showed that the oldest peron killed was an 87 year old white male who was armed. He also came from an area of higher hate crimes based on the median income of his state. The summarize statistics were used to analyze various variables based on grouping and summary functions. The three most interesting functions will be discussed based on grouping of the race/ethnicity of the victims, their ages, and the hate crimes within populatiom size. Based on the ethnicity/race grouping of the victims, summary stats looked at the median age. The youngest group of te race for the victims was 27 and it belonged to the Native Americans. The oldest race group, however, belonged to white people and was at 39 years old. With grouping on victim ages, 16 year old victims were found in an average population size of 4503. When grouping by hate crimes recorded, summary stats were run on people who voted for Trump by state. The data showed increasing hate crimes with increased voting for Trump, however, there are many missing variables within hate crimes to form valid conclusions.
##Visualizing
fulldata2 %>% na.omit %>% select_if(is.numeric) %>% cor %>% as.data.frame %>%
rownames_to_column %>% pivot_longer(-1) %>% ggplot(aes(rowname,
name, fill = value)) + geom_tile() + geom_text(aes(label = round(value,
2))) + xlab("") + ylab("") + coord_fixed() + theme(axis.text.x = element_text(angle = 90,
hjust = 1)) + coord_fixed()
ggplot(fulldata, aes(hate_crimes_per_100k_splc, median_house_inc,
color = state)) + geom_point(size = 3) + geom_line(aes(group = state)) +
theme(legend.position = "right") + theme_classic() + ggtitle("Hate Crimes Based on Household Income") +
ylab("Median Income") + xlab("Hate Crimes") + scale_x_continuous(n.breaks = 10) +
scale_y_continuous(n.breaks = 10) + stat_summary(fun = mean,
geom = "line")
fulldata %>% na.omit %>% ggplot(aes(pop, age, group = raceethnicity,
fill = raceethnicity)) + geom_boxplot() + theme(legend.position = "right") +
ggtitle("Victim Ages Based on Race/Ethnicty")
When looking at the correlation map, the strongest correlation is apparent at median house income and county income, with a correlation of 0.56. this could be a strong correlation because both variables focus on income within a state. The weakest correlation is found at population and and median house income, with a correlation of -0.01. Since population is not averaged while the house income is, it could be a potential factor as to why the variables are the furthest from related. The graph illustrates the relationship of hate crimes to median house come. Based on the plots, there seems to be a positive correlation between the hate crimes and median income. This relationship differs from the expected hypothesis of increased hate crimes with increased poverty. Each plot is labeled by state, however, there is no obvious trend between location of states to hate crimes and median income within the states. The boxplot illustrates the total count of the population to that of the age of people murdered by cops. As expected, the measure of the population that identified as white is the largest, considering demographics. The smallest apparent boxplot is Native American, which also corresponds properly with race measurements within the US. The boxplots show a close relationship of shared age within the different races that have been murdered by cops, however the black and white population have the most outliers of older populations.
##Dimensionality Reduction
library(cluster)
library(plotly)
fulldata3 <- fulldata %>% na.omit
fulldata4 <- fulldata3 %>% dplyr::select(share_vote_trump, share_non_white,
share_white_poverty, hate_crimes_per_100k_splc)
sil_width <- vector()
for (i in 2:10) {
pam_fit <- pam(fulldata4, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
kclust <- fulldata4 %>% kmeans(5)
kclust
## K-means clustering with 5 clusters of sizes 27, 62, 89, 180, 42
##
## Cluster means:
## share_vote_trump share_non_white share_white_poverty
## 1 0.3837037 0.2759259 0.08000000
## 2 0.4167742 0.3906452 0.07935484
## 3 0.5866292 0.2950562 0.10011236
## 4 0.4329444 0.5471667 0.08961111
## 5 0.5361905 0.1945238 0.10928571
## hate_crimes_per_100k_splc
## 1 0.7252740
## 2 0.3181779
## 3 0.1497946
## 4 0.2159560
## 5 0.3188548
##
## Clustering vector:
## [1] 3 3 5 4 3 4 4 4 4 5 4 4 5 5 3 4 2 4 4 4 3 3 4 3 1 5 4 4 4 4 1 3 3 4 5 4 4
## [38] 4 1 3 4 3 4 5 4 4 4 2 4 1 4 3 4 3 4 1 2 4 3 1 5 4 4 4 4 5 4 4 4 4 3 5 3 1
## [75] 4 2 5 3 4 3 4 4 2 2 4 4 1 4 3 4 4 2 3 2 2 3 4 2 4 4
## [ reached getOption("max.print") -- omitted 300 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 1.1812805 0.7007365 0.9437165 2.5250683 0.6085452
## (between_SS / total_SS = 73.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
pam <- fulldata4 %>% pam(k = 5)
pam
## Medoids:
## ID share_vote_trump share_non_white share_white_poverty
## [1,] 396 0.63 0.35 0.12
## [2,] 330 0.49 0.24 0.09
## [3,] 400 0.33 0.61 0.09
## [4,] 6 0.50 0.49 0.09
## [5,] 398 0.38 0.31 0.09
## hate_crimes_per_100k_splc
## [1,] 0.1258389
## [2,] 0.2851011
## [3,] 0.2558054
## [4,] 0.2253200
## [5,] 0.6774876
## Clustering vector:
## [1] 1 1 2 3 2 4 3 3 4 2 3 3 2 2 1 4 2 3 3 3 1 1 3 1 5 2 3 4 4 3 5 1 1 4 2 4 4
## [38] 3 5 1 4 1 4 2 3 4 4 4 3 5 4 2 4 1 3 5 2 3 1 5 2 4 3 4 4 2 4 4 3 3 1 2 1 5
## [75] 4 2 2 2 4 2 4 4 3 4 4 4 5 4 1 4 4 3 1 3 4 1 4 2 4 3
## [ reached getOption("max.print") -- omitted 300 entries ]
## Objective function:
## build swap
## 0.08522959 0.08457041
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
plot(pam, which = 2)
pam5 <- fulldata3 %>% mutate(cluster = as.factor(pam$clustering))
pam5 %>% plot_ly(x = ~share_vote_trump, y = ~share_non_white,
z = ~hate_crimes_per_100k_splc, color = ~cluster, type = "scatter3d",
mode = "markers", symbol = ~raceethnicity)
Since the clusters measured variables collected as fractions, the clusters had similar scaled vairables. The cluster sizes were 73, 27, 100, 117, and 83. The cluster number was 5 because between 2-6, 5 gave a sum of squares of 79.5%. While values from 7-10 were increasing, I chose 5 in order to keep a small amount of clusters while still having a decent sum of squares. The average silhouette width is weak, found at 0.47. The graph shows the clusters in a 3D shape to show the relationship between the four variables better. It is apparent that the darker green points are further at the bottom while the orange are further at the top of the graph, but with a smaller portion of non white population size.